Capgemini Data Science Project - Recommendation System

Introduction:

This notebook covers the Hackathon problem.
Problem Statement: Develop a recommendation model from the given 2 datasets.
Datasets given: Electronics Sales & Modcloth Sales.

For Recommendation, I have made 3 criteria:
a. Recommendation based on Ratings
b. Recommendation based on Demand of Items/Number of Units Sold
c. Recommendation based on Market Basket Analysis (using Apriori)

Procedure:

  1. Importing Libraries & Preliminary Data Exploration

  2. Data Cleaning

  3. Feature Engineering & Exploratory Data Analysis

  4. Recommending Items based on Ratings
    a. Recommending Items
    b. Recommending Brandwise Items
    c. Latest Trend: Recommending Items over past 2 years
    d. Latest Trend: Recommending Brandwise Items over past 2 years

  5. Recommending Items based on Demand of Items/Number of Units Sold (i.e. Best Selling items)
    a. Recommending Items
    b. Recommending Brandwise Items
    c. Latest Trend: Recommending Items over past 2 years
    d. Latest Trend: Recommending Brandwise Items over past 2 years

  6. Applying Market Basket Analysis (using Apriori algorithm)
    a. Recommending items based on items purchased in past
    b. Recommending categories based on item category purchased in past

  7. Conclusion

1. Importing Libraries & Preliminary Data Exploration

Libraries:

We will be making use of below libraries:

* Pandas: for reading & manipulating data via forming DataFrame <br>
* Plotly: for data visualisation <br>
* Mlxtend: for making association rules (market basket analysis) between items and categories

Preliminary Data Exploration


Checking the datasets by number of purchases made by every user

Users in Electronics Database

from the above, we see that there are in all 1157633 users in electronics database


From the above, we see that there are around 1054823 cases where the user has purchased only 1 item.
For remaining users, we can form a market basket analysis modelling


Users in Modcloth Database

from the above, we see that there are in all 44783 users in modcloth database


From the above, we see that there are around 30307 cases where the user has purchased only 1 item.
For remaining users, we can form a market basket analysis modelling


2. Data Cleaning

Let us check for missing values for Electronics dataframe

Let us check the types of values in the missing values columns

This column pertains to brands.Some products can be of less recognised brands and hence their data might not be visible

So we replace missing values with "Other".

Here we can replace the missing values with 'Not Applicable or NA' as this column pertains to gender.

Some products can be unisex or not specified as per gender.

Rechecking for missing values

From looking above, we see that there are no missing values in Electronics database


Let us check missing values for Modcloth dataframe

"brand" column pertains to brands.Some products can be of less recognised brands and hence their data might not be visible.

So we replace missing values with "Other".

3. Feature Engineering and Exploratory Data Analysis

For building a recommendation system, we will be focusing on columns:

Let us merge both these dataframes to form a big consolidated dataframe for electronics as well as modcloth

From the above, we see that we have data from June 1999 till June 2019 i.e. roughly 20 years

Data Visualisation:

It appears the purchases spiked in the year 2016-17.

This maybe due to boom in the E-retail sector with competing E-retail companies giving heavy discounts to boost their sales compared to their competitors

From the above graph, it appears that for most of the categories, sales are maximum in the month of January & December i.e. at the year end

Dealing with Bias

Now, when it comes to rating, there will be cases where bias comes in play, such as:

1. The producer himself/herself must have bought the product & given it 5 rating or

2. there might be some loyal customers who might have given the product a 5 rating out of feeling of loyalty

To counter that, we will only consider reviews of those items where the purchase of items has been above 50

We have already prepared dataframe which gives the total quantity sold by every item.

We will merge that dataframe with the consolidate_db dataframe

Let us check the categories for which we need to build the recommendation system for

We will check how many major brands are there per category

From the above, we see that:

  1. Computers & Accessories has maximum brands (28 major and other brands)
  2. Wearable Technology has minimum brands (1 major and other brands)
  3. Except for Security & Surveillance and Wearable Technology, all other brands have 5 or more types of brands

4. Recommending Items based on Ratings

Let us Focus on Overall Data Frame

Let us recommend products per category

4.a. Item Recommendation as per ratings received from 1999 to 2019

4.b. Brandwise Item Recommendation as per ratings received from 1999 to 2019

Let us recommend products as per brand

We will build a function which returns top 5 brands in every category and the top products for a particular category for a particular brand in descending order of their rating

Let us focus on trend for last 2 Years

4.c. Latest Trend: Item Recommendation as per ratings received over the last 2 years

4.d. Latest Trend: Brandwise Item Recommendation as per ratings over the last 2 years

Let us recommend products as per brand

4.e. Collaborative Filtering Recommendation System

Now that we have completed recommending items based on the ratings, we can also recommend items based upon the sales.

We can call these items as Best Selling Items

5. Recommending Items based on Demand of Items/Number of Units Sold (i.e. Best Selling items)

Let us group the items based on the units sold

5.a. Best Selling Items from 1999 to 2019

5.b. Brandwise Best Selling Items from 1999 to 2019

Now let us focus on the latest trend i.e. from the past 2 years

5.c. Latest Trend: Best Selling Items from 2017 to 2019

5.d. Latest Trend: Brandwise Best Selling Items from 2017 to 2019

6. Applying Market Basket Analysis (using Apriori algorithm)

for market basket, we will consider only those cases where the customers have purchased more than one item.

This will allow us to link the preferences/likenesses of user to buy an item based on the item purchased before.

For Market Basket Analysis / Apriori, higher the lift value, higher is the chance that the user will buy the consequent item based on the antecedant item, i.e. you can recommend the consequent item bought by the user upon the user selecting the antecedant item.

6.a. Item Recommendation per Category based on Market Basket Analysis (Apriori)


Now that we have established relation between products, we will now check if relation exists between categories, i.e. if a person has bought item from category A, we will try to find the likelihood that he will buy item from category B

For this purpose, we cannot take consolidated dataframe as transactions for electric database is far greater than modcloth

We have to establish the basket analysis association rules separately for these 2 dataframes

We have to drop users who have made only 1 transaction as we cannot use these cases for market basket analysis

We will drop the users with only 1 transaction

6.b. Category Recommendation based on Market Basket Analysis (Apriori)

Electronics Database

Modcloth Database

7. Conclusion:

We have analysed of data, cleaned the data and build additional features from given data which helped us build our recommendation system.

Recommendation systems built in this notebook are based on the below parameters:

i. Ratings of the item & brand

ii. Best-selling items/brands

iv. Association Rules using Market Basket Analysis/Apriori algorithm for items & categories